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Abstract - The aim of this paper is to analyze the dynamic evolution of a Virtual Private Network. The 
network is modeled as a system, controled by a manager who should take appropriate decisions. How- 
ever, to be able to take the best possible decisions, the manager should also be able to forecast the worst 
behavior, in the sense of a quality of service criterion, of the system, he wants to control. We have chosen 
to model this problem, as an iterative two side game. On the one side, the operator tries to reserve the 
minimal amounts of bandwidth to guarantee the best possible quality of communication for its various 
clients. On the other side, the traffic of the clients follows the worst behavior, in the face of the reserved 
bandwidths. 

The theory of Markov decision processes (MDP) enables us to model the uncertainty associated to the 
knowledge of the traffic. Besides, two levels should be differentiated in our system. The local level of the 
clients, who evolve independently of one another and selfishly, choosing the worst possible traffic evolu- 
tion. At this level, the manager could reserve bandwidth locally, on each link for every Virtual Private 
Network. Whereas, at the global level of the links, decisions should be taken by the manager to centrally 
control the network. 

A hierarchical MDP approach and the stochastic game framework are introduced to propose solutions 
to this difficult problem. Furthermore, we study the asymptotic behavior of the system, and prove the 
convergence towards stationary strategies. In the final section, we introduce parametrized strategies, 
whose parameters should be estimated with the help of simulation. Indeed, simulation based optimiza- 
tion, over the policy space, provides us an alternative to Bellman's principle, all the more interesting as 
this principle might become hard to apply, when the cardinality of the state space increases. 

Keywords: Hose model, Markov Decision Process, Bellman's optimality principle, stochastic Games, Cross-Entropy 
method 



1 Introduction 



During the last decades, many methods have been developed to tackle the rather hard problem of traffic matrix 
estimation. Our purpose in this article is not to develop a new method for traffic matrix estimation, but rather 



to consider the problem under a system oriented point of view. Indeed, our system is made of a telecommuni- 
cation network of nodes and directed links. The operator, or the network manager has the possiblity to act on 
the bandwidth reservation, in view of the evolution of the traffic going through the whole network. We assume 
that, at each global bandwidth allocation, the traffic evolves, following the worst configuration in the sense of a 
Quality of Service (QoS) criterion. The network operator should be able to forecast the worst possible evolution 
of the traffic, and to propose solutions so as to drive the network in an optimal way. In the context of Virtual 
Private Networks, guaranteeing an admissible QoS, via reserved bandwidths, loss, and delay characteristics, is 
a crucial task for the network manager. 

Virtual Private Networks (VPNs) are networks built between geographically distant IP-sites of a firm. With the 
help of this technology, distant sites of the same firm are able to communicate via secured tunnels. Indeed, the 
data should be transmitted via Internet, which is a public infrastructure shared by many operators. In order to 
guarantee the security of its client, the data will be encrypted and sent along virtual tunnels using MPLS tech- 
nology. Besides, a Service Level Agreement (SLA) contract should be passed between the network provider 
and its client. The aim of this treaty is to specify bounds on admissible levels of QoS. As a result, the manager 
should be able to forecast both the spatial and the temporal evolution of its traffic. 

Traditionaly traffic matrices are used to solve such problems. Nevertheless, their accuracy rely mainly on the 

quality of the estimator itself and of the data, which can be quite hazardous. The solution we have chosen to 

get a rough characterization of the traffic, is to use the hose model, introduced for the first time in [1]. 

The client is asked to merely specify: 

-the amount of traffic going in/out each of its web sites, 

-the relationships between all its web points (source destination). 



You can check that, although the hose model is quite simple to specify from the client point of view, it 
is full of uncertainty for the manager. Indeed, for each source node, for example, the operator ignores how 
the traffic is shared between the different destination nodes, which constitutes in itself a spatial uncertainty. 
Furthermore, due to the roughtness of this approach, he does not know how the traffic should evolve under this 
assumption. Consequently, we have chosen to model the dynamic evolution of the traffic as a Markov decision 
process (MDP), which enables us to introduce uncertainty, in our model. 

Index of the main notations, used extensively throughout the article. 

- {X(*^}fgN- discrete time, discrete state space stochastic process modeling the traffic in the Virtual Private Network 1. 

- xf^- traffic going from the node i to the node j, at the decision epoch t. 

- S- generic state space. 

- traffic on the MPLS network links. 

- Af- set of the sites, or nodes of the MPLS network. 

- C- set of the links of the MPLS network. 

- t""'- amount of traffic leaving the site 1 of the VPNl. 

- amount of traffic leaving the site 2 of the VPNl. 
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Figure 1 : The hose model. 

As an example, we consider the hose model applied to a small network. The firm is composed of 5 different IP sites, 
which are supposed to be geographically distant. For the site 1, which is supposed to be the head of the firm, the client 
gives the operator the connections to the other areas: 1^2, 1^3, 1^5, where the symbol means that there is a 
potential bidirectionnal connection between the two sites. Besides, the client gives the volume of traffic going out of the 
site 1, and, possibly, the amount of traffic going in the site 1. 

- tg'"- amount of traffic leaving the site 3 of the VPNl. 

- R{t)- Routing matrix at the instant t. We note R, if the routing is stable, or time invariant. 

- S^- discrete state space associated with the Markov Decision Process {X^'^ltgN- 

- S^- discrete state space associated with the Markov Decision Process jtgN- 

- S'^- discrete state space associated with the Markov Decision Process {Z^*-'}fgN. 

- A- action space associated with the Markov Decision Process (MDP) {X'*^ }teN- 

- Ft{s), s G S- vector of strategy associated with the MPD X^*\ at the decision epoch t, and far each state s £ S . 

- /t(s, a)- probability for the MDP to choose the action a, in the state s £ S, at the decision epoch t. 

- Vt{s)- value function at the time instant t, in the state s G S. 

- A ^ [axCLyaz]- vectors of actions taken at the local level, i.e. on each VPN network. 

- o,x — [axi20-X2i'^X3i]- octions taken on each link of the VPNl. 

- D — [did2...d\c\]- actions taken on each link of the whole MPLS network. 

- (B^y^^- reserved amount of bandwidth on the directed link of the VPNl, at time t. 

- (-Bj^)^*^- reserved amount of bandwidth on the directed link of the VPN2, at time t. 

- (-Bj^)^*-*- reserved amount of bandwidth on the directed link {i,j) of the VPNS, at time t. 

- (-Bj^)^*-'- reserved amount of bandwidth on the directed link U of the global MPLS network, at time t. 

- pfj- price associated to the variation of reserved bandwidth on the link (i,j) of the VPNl. 

- pjj- price associated to the variation of reserved bandwidth on the link (i,j) of the VPN2. 

- pf,- price associated to the variation of reserved bandwidth on the link (i,j) of the VPNS. 
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- Satisx- satisfaction level for the VPNl. 

- SatisY- satisfaction level for the VPN2. 

- Satisz- satisfaction level for the VPN3. 

- X S'*' X S^- state space of the global process Z^*)). 

- Ag- action space of the global process {X^*\ Y^*\ Z'*^). 

- E'^- subset of the global state space x x S^, where every state violates at least one satisfaction bound. 

- El = X X - E2 . 



2 The representation of Traffic as an MDP 

Let {X^*^}fgN, be the discrete time, discrete state space, stochastic process, representing the traffic in the network. The 

VPN network will be represented by an oriented graph: G = {M, C) , where M is the set of nodes modeling the sites of 

the network, and C, is the set of directed link of the VPN. 

Let X^j^ denotes the traffic going from the node i, to the node j, at the instant t. 

At time period t, the whole traffic is represented by a vector X^*\ i.e.: 
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The traffic on the link is obtained via the matrix equation: 
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where, models the routing matrix, which can remain constant or change with the time. The rather intuitive 

notation i|*) = X'^^^)i, I G £, represents the traffic flowing through the link /. 

The state space is defined using a simplified version of the hose model. Indeed, the chent gives a rather tight upper bound 
on the traffic going out of each node. As we are supposed to be in the worst case, we should assume that this bound is 
reached. As an example, in the three-node case, we get a system of relationships: 
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Xlf>0, Vz,jG {1,2,3}, 



(3) 



We just need to deal with the 3 components X , and X^*^ , since the others are deduced from the first. The state 
space is represented geometrically as the union of the three independent segments defined by the system (|3}. Consequently, 
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we will note the continuous state space under the form: 

S continuous c? continuous ^, o continuous cj continuous 

— oi X J2 X 03 

Every element of s*^"""™™^ could be represented under a 3 dimensional vector form: s — (si, S2, S3). Where, Si takes 
its values in the state space 8;™""™°"^ i = 1, 2, 3 . 

To be more explicit, Si™""""°"' = { (X^*^ , X^^ )\x[^^ + x[*^ = t^J"'}, represents the continuous state space associated to 
the stochastic process {X^^jt- 82'^°""""°"^ and 83'^°""""°"'* define the continuous state spaces associated to the processes 
{^21^ }t {^3*''}*' respectively. In order to get a discrete state space, the operator should fix a fiability parameter 
a > 0, which would characterize the accuracy with which he desires to know the traffic flowing through its links. Then, 
each of the 3 segments is discretized using the parameter a. The discrete state space resulting, will be logically noted, 
S = Si X S2 X S3 . 
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Figure 2: Discrete state space and action space. 



The action space is reduced to 3 distinct motions, that we will note: A — {ao; ai; 02} . Let us describe the nature 
of these actions. 

• If we choose oq, the process stays in the same state. 

• But, if we choose ai, the traffic increases with an uncertainty on the state transition. Indeed, we suppose that if the pro- 
cess is in the state Si G Sj , j — 1, 2, 3 at time t, then it will jump up, on one of the three adjacent states, according to an ex- 
ponential distribution, decreasing with the distance between these two states. Using the numerotation given in the figure 2, 
we simulate a normalized ordered sample of the three transition probabiHtes [p{si-i\si,ai),p{si-2\si,ai),p{si-^\si, ai)]. 
More explicitly, 

p(si_fc|si, ai) 5(Ai), Al > 0, fc = 1, 2, 3, 
p{si-i\si,ai) >p{si-2\si,ai) > p(si_3|si, ai) , 
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under the normalizing constraint: ^p(si_/j|si, ai) = 1 . 

k=l 

£{Xi) symbolizes the exponential distribution of parameter Ai > . 

X-£:(Ai) ^ f{x;Xi) = Ai exp-^^ " 1r+(x) . 
We notice that if i = 3, the traffic can only jump on one of the two adjacent states. Consequently, we get the rules: 

p{si-k\si,ai) ~ S{Xi), Ai > 0, fc = 1,2, 

p{si-i\si,ai) >p{si-2\si,ai) , 

2 

^p(sj_fc|sj,ai) = 1 . 
^. fe=i 

If i = 2, there is only one possible transition, 

p{si-i\si, ai) = 1 , and finally if i = 1, the traffic has no choice but to stay in the state where it is. This phenomenon 
results from the finite nature of the state space. 



• Finally, if we choose a2, the traffic jumps down, on one of the three adjacent states. Formally, we set: 

j p{si+k\si,a2) ~ f (A2), A2 > 0, j = 1,2,3, 

1 P{Si+l\Si,a2) >p{Si+2\Si,a2) > p(Si+3|Si, 02) , 



we still have a normalizing constraint of the form: ^^p(si+fc|si, a2) = 1 • 

fe=i 

f (A2) symbolizes the exponential distribution of parameter A2 > . 

X~f(A2) ^ f{x;X2) = X2 exp-^^^ 1r+(x) . 
If « > (|Sj I — 3), we get the same limitations on the transitions as previously mentionned. 



2.1 Iterative game between bandwidth reservation and traffic allocation 

Remind that a strategy specifies for each state s € S and each time t, the probabihty to choose one of the three actions. 
Under the vector form, we obtain: 

Vi G N, Vs G S, Ft{s) = (/t(s,ao) /t(s,ai) Ms,a2)f . 

However, this probability vector is stochastic, and consequently, must satisfy the following constraints of normalisa- 
tion, and non negativity. 

' Y.Ms,a) = l, VtGN, VsGS, 

aeA 

ft{s,a) > 0,Vt e N, Vs G S, Va G A . 
For each time period t, the strategy is represented by an associated matrix Ft. 



Ft = {Ft{l) Ft{2) ... Ft{N)) 



' ,ftil,ao).fti2,ao)....ftiN,ao) \ 

ft{l,ai)ft{2,a,)...ft{N,ai) 
^ Ml,a2)fti2,a2)...MN,a2) j 



We begin to recall basic definitions, which may be very usefuU for a proper understanding of the rest of the article. 
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Definition 1. A strategy is stationary, if it is invariant with respect to the time, i.e.: 

yt e N,Vs e S, ftis,a) = /(s,a) , Va £ A , 
and deterministic or pure, if there exists a unique optimal action for each state, at each instant. Which means that: 

VseS, /t(s,a) G {0;1}, aeA. 



At first, we deal with deterministic strategies only. Furthermore, we suppose that the horizon is finite. 

We note: tt = (Fq, Fi, Ft), the sequence of stationary strategies defined on [0; T]. 

To begin with, we consider again the simple model of a 3 site network. The sites will be numbered 

M ^ {1,2,3} , 

and are associated with nodes. The directed links are stored in the set 

C = {(1,2); (1,3); (2,1); (2,3); (3,1); (3,2)}. 

Furthermore, we suppose that the routing is stable, i.e. time invariant, and that between each couple of nodes, the only 
possible path is the directed link joining these two nodes. 




We have chosen to cope with an objective function modeling the delay on the whole network, which is one funda- 
mental parameter in the QoS requirements. In fact, due to the simple structure of the example, each link is associated with 
an A//A//1 queue, and consequenly the global delay on the whole Virtual Private Network, takes the form: 



X. 
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it) 



B 



(t-i) 



(4) 
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Where, i?!*'' is the bandwidth reserved on the directed hnk at time t, by the network manager The second part 

of this equation stands for a penahy criteria. Indeed, in order to minimize the first part of the equation, the operator 
should choose to increase infinitely far the reserved amount of bandwidth. Fortunately, the second part, introduces a price 
Pij > 0, linked with the variations of the reserved bandwidth. Under this assumption, the manager's interest should be to 
choose relatively stable values for the amounts of reserved bandwidths. 

Our problem takes the formal form: 

TT* = arg min arg max l E^iS" Ct{X^'\ Ft)\Xo ^ s] \ s E s\ . (5) 

B=(B(0),B(i),...,S(T)) TT=iFo.Fi,...,FT) ' ^ ' 



t=0 



Remark. Since the strategies are deterministic, there exists a unique optimal action at each time instant t, and for 
each state in the state space S"^. Consequently, you guess easily that Ft{X''*^), contains the optimal action associated 
with the random variable X^*'\ at the instant t. The cost function can then, naturally be interpreted as follows: 



Ct{X^'\Ft) = ^ 



<^[X^+F,{X^>)]-X 



(6) 



The parameter (3 £ [0; 1[, often called discount factor, captures the natural notion that a reward of 1 unit at a time of 
{t + 1), is worth only /3 of what it was worth at time t. 

In order to simplify the expression of Q, a quite natural idea might be to isolate the sum into two parts. Hence, following 
our intuition, we write. 



TT* = arg min arg max <^ E^[Co(X(°), Fo)|X(") = s] + E^[V /3* Ct(X(*\ = s] 

B = ° tt={Fo,Fi,...,Ft) 1 ^ 

Then, it comes easily that, 

it' = arg min ( argmax[Co(s, i^o)] + arg max (E,[^ /3* C^C^W, = . 

I -Fo Fi,F2,...,Ft ^ I 

If, we repeat once more the same decomposition, we get the expression: 



TT* = arg min jargmaxfCofs, Fn)! + argmax > [/3 Ci(s', Fi) p(s'|s, + 

,B=(B(0,,B(i),...,B(T))' -Fo F^i ^ 

T 

arg max E,[^ /3* C,(xW, F,)|X(o) - s]} . (7) 

The equation captures the essence of the principle of optimality, which is based on the recursive nature of the 
equation, and the introduction of the value function Vt{s), s e S. 

In fact, solving (|5j is equivalent to computing a solution to Bellman's optimality equation, which takes the following 
special setting: 

r isi ] 

Vt(s) = min max {CT-t(s, a) + P p(s'\s, a) Vt-i(s')} , Vs eS, Vt e {0,1,2,.. .,T} . (8) 
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To solve this equation, we proceed by backward induction. 



★ To begin with, we suppose that att = T, the reserved bandwidth is fixed. Then in each state s, we have to find the 
set of actions on each link, which maximizes the equation: 



-,T-1 



are max 

aGAlsl 



E 



(9) 



The traffic being fixed to its new value: X'"^' = X'"^ + a'^ we would like to find the minimal amount of 
bandwidth to be reserved on each link. Consequently, we must solve the optimization problem: 



arg mm 



E 



X 



(T) 



Bi 



X. 



(T) 



' Pi] (Bi^ 



(10) 



The solution of this continuous optimization problem can be obtained analytically. That's why, we express it as a 
function of the worst traffic allocation, at time T. 



Finally, substituting (|1 ip in the equation we get the simpler expression: 



,T-1 



arg max 



E 



(T-l) , 



(T-1) 



+ ay) - {x 



(T-l) 



(11) 



(12) 



where, a;'^ is a realization of the traffic process in the state space S. 

Now, the value can be easily computed, and we set: 

Vi{s) = CTMs,aJ-^), Vs6S. 

★ Then, at the iteration {T — t), t > 1, we proceed exactly the same way. An optimal action, and the associated 
optimal rewards are known, for the last (t — 1) stages. Then, with t stages to go, the only thing we need to do, is to 
maximize the immediate expected reward and the maximal expected payoff for the remainder of the process with {t — 1) 
stages to go. As a result, we obtain the expression: 



are max | > 

|Si| 



(<i>(a 



(13) 



And, the value takes the form: 



Vt{s) 



{E 



(T-t) . {T-i) 



(T-t) {T-t)' 



(T-t) ^JT-t)^ 



+ ft,(<i>(4^-(*-^»)-ci>(x. 



(T-t) 



|Si| 



(14) 



We go back with the same idea, until t = T. 
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3 Application to the management of a 3 site - VPN 



3.1 Stable and mono-path routing 



To model this simple system, we introduce 3 MDPs, called respectively Xl 



it) 



1-^12 ' -^13 J' ^2 



Xg*-* — (^3*\ ^32'' )■ In f^ct, these MDPs are not really bi-dimensional, since we only need to take into account the first 
components. Indeed, the second ones are deduced from the first ones, using the set of equalities (|3j. Since the routing is 
supposed to be constant, we note that these three MDPs are completely independent of one another Consequently, our 
initial problem can be separated into three disjoint sub-problems. 



arff min arg max < E_xi 



arg min arg max < E-x, 

B^2 TT^2 



{t:^'"^)* — arg min arg max Ett^^ 



B^3 



t=o 

T 



,t=o 



S £ Si 

s e S2 



(15) 



To solve these equations, we simply apply Bellman's optimality equation, and backward induction, as explained in 
the previous section. 
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3.2 Existence of stationary strategies 

The notion of stability is fundamental in the theory of dynamical systems. The usual idea is to prove the convergence of 
the system towards an equilibrium point. Transposed to the theory of Markov chains, equiUbrium points are associated to 
invariant measures. Finally, in the theory of MDPs, these invariant measures become stationary strategies. 
We will start by computing the stationary strategies using the elegant approach developed in more details in [5]. 

Let y be an arbitrary vector taking values in the state space S. Using the definition of the optimality equation, we 
know that in each state s e S, the value function should satisfy the inequahty: 

|S| 

V{s) > C{s, a)+(3 J2 P(^'\^^ Va e A, Vs G S . (16) 

s' = l 

Remark. The function C is actually independant of the time, since the strategy should be stationary. 
If we multiply the above inequalities by /(s, a), and sum over all the a e A, we get: 

Vis) > Cis, F)+pJ2 P(''\'' F)V{s'), Vs e S , 

s'=l 

where, C(s,F) = ^C{s,a)f{s,a) , and p{s'\s,F) = ^p{s'\s,a)f{s,a). 
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Figure 6: Sojourn times for the MDPs ^12' ^21 -^31' * — l^^' — sojourn time associated 

with a specific couple of state and action (s,o), s G S, a G A, represents the number of times among the 
decision epochs {0, 1, T}, in which we choose the action a in the state s. It enables us to characterize the 
standard behavior of the system. 
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Figure 7: Evolution of the sojourn times. The distribution is more homogeneous, since asymptotically the 
choice of the action in each state is time invariant. 



In matrix form, the set of inequalities becomes: 

V>C{F)+(3P{F)V , 
where, the reward function under the strategy F, can be written under the vector form: 

C{2,F) 



C{F) = 



\ c(|si,F) ; 



and the probability transition matrix becomes, 



P{F) 



( p{l\l,F) p{2\l,F) ... p(|S||l,i=^) \ 
p{l\2,F) p(2|2,F) ... p{\S\2,F) 



\ p{l I |S|,F) p{2\\S\,F) ... p{\S\ I \S\,F) J 
Upon substituting the above inequality into itself k times, and taking the limit as A; — > oo, we obtain: 



V > [I - /3 P{F)]-^C{F) . 
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But, 



t=0 



t=0 



since the transition probabilities are time homogeneous. 
Going back to our matrix formulation, we get: 



/ oo \ 

t=o 

oc 



t=o 



[/-/5P(F)]-iC(F) 



y/3*EF[C(x(*\F)|x(°) = |S|] 
V to J 

We see that an arbitrary vector V satsfying (|16p . is an upper bound on the discounted value vector due to any stationary 

strategy F. Consequently, we should naturally think that the discounted value vector might be the optimal solution of the 

linear programm: 



min ^^7(5)1^(5) 



|S| 



Vis) > C{s, a)+P ^p(s'|s, a)F(s'), a £ A, s £ S 

s' = l 

with 7(5) (7(5) > and ^^7(5) — 1), being the probability that the process begins in state s € S. 
By duality, we get: 



ses 



|s| 3 

maxy^y^ C(s,a) Xsa 

s—l a—1 
|S| 3 

Y,Y.^S{s,s')-Pp{s'\s,a)]xsa = lis), s'eS 

s—1 a—1 

Xsa ^0, a e A, s € S . 



(17) 



Xsa, s G S, a G A, which is the dual variable, can be heuristically interpreted as the long-run fraction of decision 
epochs at which the system is in the state s, and the action a is made. 



Remark. It can be shown (see [5]), that if the variables Xsa are obtained using the simplex algorithm, as the solution 
of the above linear program, then the associated stationary strategy is indeed a deterministic stationary strategy. We note. 



aeA 



Xsa , for each s G S. In our context, we get the following system of equahties: 



2^200 + X2ai + X2a2 = ^2 



(18) 



ns|ai + 2;|s|a2 = a;|s| 



Xsa > 0, Vs G S, Va G A . 

However, since x° = (xlao^laA^ Ix^ao^L^x^a^ I • • ■ l^^jsiao^^lsiai^^jklas)' an optimal basic feasible solution for 
the simplex algorithm, it is necessary an extreme point of the space defined by the system (jlSp . Using the definition of an 
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extreme point, for all s e S, there exists a unique i G {0; 1; 2} such that 
Consequently, the stationary control F'~' constructed from x^, by setting: 



sa. 



= X' 




O,forj e {0;l;2},j^i. 



X' 



.0 



E 



sa 



.0 



Va G A, Vs G S, 



X 



sa 



aeA 



is deterministic. 

We have proved the existence of a stationary strategy for our problem. But, are we sure that the set of feasible strate- 
gies determined on [0; T] with the help of Bellman's optimality equation converges asymptotically to these values? The 
concept of ergodicity is well-known in the field of system engineering, and more specifically in queuing theory. The 
problem with such systems, is to define operational parameters which should optimize the performance of our system. 
Those parameters are usually deduced from the study of the stationary behavior of the global system. The idea should be 
to study the dynamic evolution of a given trajectory of the system. But, do all these specific reahzations adopt the same 
asymptotic behavior? Is there a law linking operational and stochastic performance parameters? 

Basically, a system is said to be ergodic, if all the specific realizations of the dynamic evolution of the system are asymp- 
totically and statistically the same. In fact, ergodicity is synonimous with equality between spatial and temporal means. 
As a result, in such a framework, the operational parameters are equal to the stochastic performance parameters. Trans- 
lated to the MDP context, the property of ergodicity is defined conditionally on the choice of an action. The theoretic 
definition below introduces the notion of ergodicity from a measure theoretic point of view. 

Definition 2. Conditionally to the choice of an action a G A, the Markov chain 
(S, {p(s'|s, a)}s,s'es) is ergodic if: 



where, /i*(B) = / P[X(*+^) G B | X'"' ~ x] . Besides, a well known result states that in the case of 
an ergodic MDP, the set of strategies {/t(s, a)|s G S, a G A}t, converges to a set of stationary strategies {/(s, a)\s G 
S, a G A}, i.e. strategies which are time invariant. 

In practice, we would rather use the fudamental result evoked before, which states that to prove the ergodicity of a Markov 
chain, it suffices to establish the equality between temporal and spatial order means. Since conditionally to the choice of 
an action (S, {p(s'|s, a)}^ ,5/gs) is a Markov chain, we are able to transpose this result to the theory of MDP. Then, for 
every a G A, we have to verify that: 



The second part of the equality is obtained using simulation. Indeed, using Bellman's optimality principle, and for T 
large enough, we are given a set of optimal deterministic strategies on [0; T]: tt = [Fq, Fi, ...,Ft) .In order to build 
sample trajectories, we just have to choose an arbitrary initial state, or even better, an initial distribution on the state space. 
Then, we should find the optimal action associated to the state, at the decision epoch t. As a result of this action, we are 
driven in a new state, and we repeat the process until t ~ T. 

To compute the first part of the equation, since the first term is positive, we can interchange the sum and the limit. And, 
to determine /(s, a), we use linear programming. Indeed, the linear program (|17|) . introduced previously, gives us the 
values of the optimal parameters {xsa | s G S, a G A}, via the simplex algorithm. The strategies obtained using the 
normahzing ratio. 



G B|X("' = s] - ^*(B)| ^ 0, Vs G S, VB G B{S) , 



(19) 




(20) 




Vs G S, Va G A, 



aeA 
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are stationary and deterministic. Since the empirical and the statistical means coincide asymptotically, we deduce that the 
sequence of optimal strategies determined by dynamic programming, converges to the stationary strategy obtained using 
linear programming. 



Convergence of the empirical mean of Xf (f),acfion f 
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Convergence of tfie empirical mean of X2{t), action 2 
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Convergence of tfie empirical mean of X3(t),action0 
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Convergence of the empirical means of the MDPs conditionally to the choice of the actions, t < 100. 

Convergence of tfie empirical mean XI (t), action 1 
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Convergence of tfie empirical mean X2(t), action 2 
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Convergence of the empirical mean X3(t),action0 
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Convergence of the empirical means of the MDPs conditionally to the choice of the actions, t < 300. 



Remark. What happen's to this model, if we suppose that the routing is multi-path and changing? We notice that 
the three MDPs are not independent anymore. Consequently, the state space and the action space are made of all the 
combinations of 3 elements taken from the initial state space S'^, and the initial action space A, respectively. Let C, be 
the set of links of the network, and R{t), the routing matrix at time t. The optimality equation remains unchanged on the 
form, but the cost function is more complicated. 

Vt{s) = min max] CT-t(s, a) + yp(s'|s, a) V^t+i(s') I , Vs e S, Vte {0,1,2, ...,T}, (21) 



15 



where. 



1^1 



- m) x(% 

pi > 0, is the price associated with the hnk I. 
Besides, conditionally to the choice of a three-dimensional action, we make the assumption that the transition probabilities 
are independant of one another, i.e.: 

P [{s'l, s'2,s'^) I (si, S2, S3), (ai, a2, as)] = p [s[ \ (si, ai)] p [4 | (s2, a2)] P [sg | (ss, as)] , V(si, S2, S3), (si, Sj, S3) G S. 



Dynamic evolution site 1 , X(0)=GiobalState(59) 




Dynamic evoiution site 2, X(0)=GlobaiState(59) 
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Dynamic evolution site 3, X(0)=GiobalSta1e(59) 



Dynamic evolution site 2, X(0)=GiobalState(1 1 3) 
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Dynamic evolution of the traffic on the VPN for a changing routing, with exponentialy distributed weights. 

Dynamic evolution site 1 , X(0)=GlobalState(59) Dynamic evolution site 2, X(0)=GlobalState{59) 
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Dynamic evolution site 3, X(0)=GlobalState(59) 



Dynamic evolution site 2, X(0)=GlobalState(1 1 3) 
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Dynamic evolution of the traffic on the VPN for a changing routing, with normaly distributed weights. 



4 The need for a centralized control in a MPLS-network 



We have introduced a way to compute the worst dynamic evolution of the traffic on a 3 site-VPN with or without changing 
routing. Going deeper in the reflexion, we can question ourselves about the possibility to manage a network of at least 3 
distinct VPNs. 
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Each VPN expects that the manager would satisfy the level of QoS it has chosen in the SLA. Consequently, the oper- 
ator should be able to manage 3 different VPNs, sharing the same infrastructure, each having specific requirements on the 
quality of service. Furthermore, the VPNs are not aware of one another presence in the network. In fact, at the local level, 
each VPN behaves completly selfishly, insofar as it tries to optimize its own criteria, using the shared bandwidth, without 
any knowledge of the needs of the others. 

Let £ — {1, 2, .., 12}, be the set of the links on the global network. 

S' = {1, 2, |S'|}, I G C, contains the state space associated with the link I, in the MPLS network. 

Actually, we define a bound for each VPN, which models the admissible level of delay that the client is able to bear. 
We note these levels: Satisx, Satisy and, Satis^, respectively. 

The stochastic process associated to the VPNl, will be noted: X**) = {x[*^ , X^*^ , X^*^). 

Recall that X^j\ stands for the traffic going out of the site i towards the site j of the VPNl, at the instant t. Similarly, the 
process associated with the VPN2 is denoted: = (^^12'; ^2i\ ^3i^)- 

Y^p represents the traffic flowing from the site i, to the sitej, onthe VPN2, at the instant t. Finally, Z^*) = {Z^*] , Z^i, z'^}), 
will represent the traffic on the third VPN. 

At the local level, the game is still played the same. At each time, the operator makes a bandwidth reservation on the link 
of the VPN, the traffic chooses the worst associated allocation on the VPN's links. At the global level, the decisions are 
centralized. Indeed, the actions are chosen directly on the links of the global MPLS network. As in the local approach, 
there are three distinct available actions for each link: 

D = {(f-d^;(f} . 

• d", means that the traffic on the link remains unchanged, 

• d^, means that the traffic on the link increases with some uncertainty on the next state it will enter. 
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p{si^j\si,d'^) £{vi), vi > 0, j = 1,2,3, 
p{si^i\si,d}) > p{s.i-2\si,d^) > p{s.i-3\st,d^) , Si e S' , 

3 

under the normalizing constraint: ^^p(si_fc|si, d^) — 1, and with the the usual limitations, due to the finite cardinality 

fc=i 

of the state space. 

• d^, means that the traffic on the link decreases, and the laws are of the same type as previously explained: 

p{si+j\si,d'^) ^£{i'2), V2 > 0, j = 1,2,3, 
p{si+i\si,d^) > p{si+2\si,d'^) > p(sj+3|sj,(i2), e S' , 

3 

with: ^p{si+k\si,d'^) = 1 . 

k=l 

The idea now, is to define a rule, which would give the optimal decision epochs at which the decisions should be taken 
centrally, while the control should be chosen locally during the rest of the time. Consequently, we choose the following 
rule : 

Rule. If one of the bound is not satisfied, then the decisions are taken at the global level, i.e. on the link of the whole 
network, until all the bounds become satisfied. 




Figure 9: A centrally managed network. 



4.1 A hierarchical finite horizon MDP approach 

In this section, we make the assumption that the horizon is finite. 

We start by defining a global process on the MPLS network: (X**) , y , Z^*)). 

- X*^*) G S^, represents the amount of traffic on the VPNl, 

- e S^, is the volume of traffic flowing through the VPN2, 

- e S^, is the amount of traffic on the VPNS. 



18 



-LW = R 



y(t) 



, represents the amounts of traffic on the links of the MPLS network. R, is the routing matrix, 



which will be supposed to stay stable, i.e. there is no change in the routing. 
The actions at the local level, will be denoted: 

A{s{t)) = [ax{x{t)),aY{y{t)),az{zm,'^{<t),y{t),z{t))€S'' xS^ xS^, V t G {0,1, ...,T} . 

Indeed, each components of this vector is associated with the action that should be taken on each site of each VPN, at 
the instant t, in each possible state. For example, ax{x{t)) = [axi2 (2^12 (i)), 1X21 {x2i{t)), 0x31 {xsi{t))] contains the 
actions to be taken on the links (1, 2), (2, 1), and (3, 1) respectively, of the VPNl. 
At the global level, the actions are centralized, and noted: 

D{l{t)) = [d^{h{t)),d2{l2{t)),...,d\ci{lic\{t))] , yii{t)eS\ Vt e {0,1,...,T}. 

Each element di{li{t)), stands for the action to be taken on the Unk i, of the MPLS network, at time t, provided the link is 
in the specific state li{t). 

At the local level, we have to determine one optimal sequence of strategies per VPN. Formally, the problem can be 
written under the form: 



arg min arg max < E^x 



(tt^)* = arg min argmax<E^y 
(tt^)* — argmin argmax<E^z 



.t=o 

f:cr(r(*),i^,^)|F(°) = ., 



t=o 

T 



^Cf(zW,Ff)|z(°)=.3 
Lt=o 



I si e 

S3 G I 



(22) 



On the Unks, the structure of the decision problem is the same, except that the decisions are chosen on each link 
separately. 



(tt )* = arg min arg max < E^i, 

B^ TT^ 



.t=o 



leS^ xS^ X ...X Sl 



(23) 



Recall that the reward functions, defined for each of our 3 site- VPN, are of the form: 



Cf(X(*)) 

cr(y(*)) 

Cf(Z(*)) 



E 

{iJG{l,2,3}, i^]} 

E 

{i,iG{l,2,3}, iyij} 

E 

{i,iG{l,2,3}, i^j} 



X. 



(*) 



yit) 



(0 



(24) 



7(t) 



According to the same idea, the reward on the links is of the type: 



E 



{ie{i,2,...,|£|}} K^f)^*^ - ^ 



(t) 



■Pi 



(25) 
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We then put Bellman's principle of optimality in application, in order to get the sequences of optimal deterministic 
strategies: (tt^)*, (tt^)*, (tt^)*, and (tt^)*. 

Suppose now, that we have computed these optimal strategies. The next problem we have to face, is how we could build 
the optimal trajectories of the worst traffic process, for each VPN. 
We begin to choose an initial state for each trajectories. 

★ (t = 0). 

-On the VPNl, we will note x*^"^ ~ {xf2 , 2^21'' j ^31^ )' '^^e chosen initial state, 
-on the VPN2, we will choose: = (2/^°^ , y^"' , ), 
-and finally, on the VPNS, we let: = (42\4i^4i^)- 

*{t = 1). 

If C(a;^°^) < Satisjf. and, C(y(°^) < Satisi', and, C{z^°^) < Satisz, then, we choose the associated optimal actions, and 
get: 

2/W = yio)+ay{yW), 
zW = ^(0)+a^(^(0)) . 

* At the t*'' iteration, we check whether or not, C (a; W) < Satisx, and, C(j/W) < Satisy, and, C(2W) < Satisz. 
If it is the case, we follow exactly the same way, and obtain: 

' ^(t+i) = a;W+ax(x(*)) , 



2/ W+ar 



However, if the levels are overwhelmed, then, the decisions are centralized. We start by computing the associated 
amount of traffic on each Unk of the MPLS network. In matrix form, we get: 



,(t) 



(*) 



= R 



(26) 



As we actually know in which state the MPLS network globally lies, we choose the optimal action associated. This 
action tells us the worst way the traffic behaves on each link of the MPLS network. 



/(t+i) 



(t+i) 



V 



* At (i + 1), we have to check whether or not the levels are satisfied. But, we only know the global amounts of traffic 
on each link of the MPLS network. In fact, we need to determine the amounts of traffic flowing through each oriented 
couple of nodes, on each VPN. The traffic being model as a global matrix for each VPN, we have to cope with the matrix 
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equation: 



/(t+i) 



R 



,(t+i) 



(27) 



V 



Unfortunately, the problem is severly undertermined, in most applications. 

4.2 How to jump from a global level to local levels? 

Various statistical techniques of estimation can be employed to solve such problems. In this paper, we have chosen to use 
an original approach, at least in this field, based on the Cross-Entropy method ([16]). Indeed, this technique seems to be 
well-adapted to solve problems of changing routing, and consequently, it could be envisaged to be used in extensions of 
our approach. 

4.2.1 A brief introduction to the Cross-Entropy (CE) method 

The CE method ([16]), is a new generic approach to combinatorial and multi-extremal optimization, as well as rare event 
simulation. It was motivated by an adaptative algorithm for estimating probabilities of rare events in complex stochastic 
networks, which involves variance minimization. In fact, it was soon realized that a simple cross-entropy modification 
could be used not only for estimating probabilities of rare events but for solving difficult combinatorial optimization 
problems as well. This is done by translating the deterministic optimization problem into a related stochastic optimization 
problem and then using rare event simulation techniques. 

The naive idea to estimate rare events is to simulate huge samples of data. Another, less fastidious idea, should be to used 
Importance sampling, whose aim is to simulate the system according to a density, which should increase the occurence of 
this rare event. Whereas the determination of the tilting parameters used in the IS technique is quite hard, the CE method 
provides a way to cope efficently with such a phenomenom. 

Let 5 : A" ^ R, be a real value function. We introduce X = (Xi, X2, X^v), which is a random vector defined on the 
space X. Let {/(.; w)},, be a family of parametric densities with respect to the Lebesgue measure. 
Actually, we want to estimate: 

I = Pu[{S{X) > 7}] = E4{S{X) > 7}] . 

If I < 10~^, we say that the event {S{X) > 7}, is a rare event. Using IS, we try to simulate a random sample 
according to an importance sampling density g, on X. As a result, we get an estimator of the form: 



N 



I 



1^ J{X,-u) 



1=1 



The optimal zero variance associated estimator can easily be computed. 

i{s(x)>7}/(a;;w) 



I 



(28) 



(29) 



The idea in fact, is to choose g in the family of parametric densities {/(.; which is equivalent to determine the 
optimal associated parameter. To determine this parameter, we will find the parametric density f{-;v) which is the nearest 
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from g*, using the Kullback-Leibler distance. This pseudo-distance between two densities g and h, is defined as follows: 
V{g,h) = Eg[ln||^] = j g{x)\ng(x)dx - J g{x) In h{x) dx . 



h{xy 

As a result, minimizing the distance between g* and /(.; v) is equivalent to solving: 

max / g*{x) In /(x; v) dx . 
J 

By substitution of (|29p into this equation, we get: 



max V{v) = maxE„[l|5(x)>7} In f{X\v)] . 
Finally, we can estimate v, using the associated stochastic problem: 

1 ^ 

V* = argmax — ^l{s(jc,)>^} In . (30) 

1=1 

Now, we will try to highlight the Unk between rare event estimation and classical optimization problems. Consider 
an optimization problem of the form: 

S{x*) = 7^ = maxS'(a;) . (31) 

Our goal is to change this optimization problem into an estimation problem. Let {l|5(2.)>-y}, 7 € M}, be a collection 
of indicator functions, and {/(.; v),v ^ V}, be a parametric family of densities. 

For a fixed level, u G V, we associate to the following estimation problem: 

Z(7) = P„[{5(X)>7}] = Y.Ms(x)>,}f{x;u) = E„[l{s(.)>^}] . (32) 

X 

If 7 is close to 7"^, then f{.\v*) will put the major part of its weight in x* . Consequently, the estimator developped in 
the context of rare event simulation, can be put in aplication. However, to get a good estimator of that kind, it is necessary 
that S{x) > 7 for many realizations of the sample. This means that if 7 is close to 7*, then u must be chosen so that 
^u[{S{x) > 7}] remains not to small. The idea is to simultaneously simulate a sequence of levels 71, 72, ir, and a 
sequence of parameters vi, V2, v't, such that 77- tends towards the optimum 7*, and that v't allows the density to give 
a higher weight to the states improving the performance. This bi-level algorithm takes the simple form: 

Algorithm 3. 1- Choose vq — u, and let t = \. 

2- Generate Xi, X^ ~ f[.]Vt-i), then compute the estimate of the {\— p)-quantile '^t of the performance function. 



It = '5'([(i-p)Ar]) , 



where, >S'([„]) ,is the n*^^ element of the ordered statistics. 

3- Using Xi, Xn, solve the following stochastic problem: 



1 ^ 

vt = a^:gmax—'^l^s{x,)>it}HfiXi;v)). (33) 

i=l 



4- If it = 7t-i = ... = It-d, STOP, 
else, set t = t — 1. 

d is a constant (d = 5 is generally a good compromise), and p characterizes the level of rarity chosen. 
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4.2.2 Application of the CE method to estimate the amount of traffic on each VPN 

We define a parametric utility function for each VPN. This utiHty function represents the subjective interpretation of 
the network manager on the impact of its bandwidth reservation on the quality of service for each VPN composing 
the whole MPLS network. We suppose that the manager's interpretation follows a gamma density, whose parameter 
Pi oo[, i ~ 1, 2, 3 is unknown. Indeed, it might be possible that the traffic sent on the VPN 1, is not of the same 
type as the one on the VPN 2. Hence, the operator would not use the same utilty function to characterize the impact of its 
allocation, on the traffic of the VPN 1, or 2. 



Ut 


ilitj 


Elastic Applications 

/ "Delay" adaptative Applications 

A ^ 

/ K=J 

p<=l nnnriivirifh 




MTi (no sufficient bandwidth => the connection is not initiated) 



Figure 10: The utility functions. 



The gamma density is particularly well adapted to model the two types of traffic, we have to deal with. The first one, 
associated with densities whose parameter p g]1; 2], models elastic traffic. This type of traffic have no real requirements 
in terms of delay and transfer rates. Classical examples are email and data file transfer. The second one, for p > 2, 
represents appUcations sensitive to delays, which requires an instantaneous transmission, like voice, or video over IP. If 
there is no sufficient bandwidth, the connection is not initialted, due to the existence of compression software which have 
an upper bound on the possible compression. This explains why the utility function equals zero below a certain value, MR 
(see [17]). Furthermore, the fact that the densities decrease asymptotically, can be interpreted using an economical point 
of view. Indeed, if we reserve very large amounts of bandwidths for a particular traffic, the link capacity will be saturated, 
and the operator will make the client pay a lot, since there is no more ressource available for other type of traffic. 

We introduce a vector notation to store the amount of reserved bandwidth on each VPN link. For the VPNl at the 
decision epoch (< + 1), we note: 



B 



(*+i) 

X 



(t+i) 
(t+i) 

^23 

(t+i) 
X31 



(34) 
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We proceed the same way to define By'^^^ and sSj*^^^ , which are respectively the volumes of reserved bandwidth on 
the VPNs 2 and 3. Besides, we suppose that these variables are generated following a gamma denstity. 



X 

(t+i) 

Y 

(*+l) 



liPl), Pi e]i;oo[, 

7(^2), P2 e]l;oo[, 



(35) 



- 7(^3), P3 e]i;oo[. 

Actually, our aim, is to estimate the value of the unknown parameters pi, p2 and p^, and to get a random sample 
solution of the equation (|36|) . Recall that the gamma density ^{p) p > 0, is of the form: 



f{x;p) 



where, T{p) = e ^ ^ dx . 



In terms of bandwidth reservation, we get the following matrix formulation: 



link\(t+l) 



B. 



it+l) 



B 



R 



B 



[t+i) 



\ 

At) 



B 



(36) 



On each link (j, j) of the VPNl, knowing the volume of traffic X^^' flowing through this link at the discrete time t, 
we can compute the minimum reserved bandwidth needed, solving a continuous optimization problem. The solution can 
be written using a bijective representation. 



B 



(*) 



ij ) 



X. 



(t) 



2 X, 



(*) 



2p; 



X 



(37) 



Hypothesis. We suppose that the sum of the parameters is constant: pi + P2 + Pa = K, K > . 

This point takes into account the a priori of the network manager, on the nature of the traffic he sent. For example, he may 
know that the traffic on the VPNl and 2 is elastic, which implies that pi + P2 < 4, and fix an upper level on the density 
parameter for the VPNS, 2 < p3 < 4. Consequently, he gets the upper bound: K = %. 

Furthermore, since the random vectors S^^^^ , By^^^ and i?^*^^-* are independent, the vector density is of the form: 
l{p) = l{Pi) 7(^*2) 7(P3) • In the application of the CE algorithm, we suppose that the decision epoch is fixed to 
{t + 1), but this algorithm can be applied anytime we have to jump from a global level to the local levels. 

We will check the steps described in the CE algorithm. 
1- We begin with the initialization of the density parameters. 

/ 

V 



-(0) 

Pi 

-(0) 

P2 

-(0) 

Pi 
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2- At the simulation instant r > 1, we simulate a sample of N vectors B^, B^, B^ , where 



B' 



{By 



i = 1,2,...,N . 



Hence, we infer the volumes of reserved bandwidths on each link. 

/ {BxY \ 



Bh 



R 



{ByY 



\ {BzT J 



, i=l,2,...,N. 



Then we compute the performance function, 



(38) 



S{B') 



, i = l,2,..,N. 



||(-5lmk)(t+l) „ (B^inkyW 

After ordering the statistic, we obtain the (1 — p)-quantile of the performance function using the estimator: 

7t = >S'[(i-p)jv] • 

3- Finally, to get the parameters updated, we have to solve a system of two equations. 

N 



in(r(pi)r(K-pi-P2)) 



in(r(p2)r(Jf-pi-P2)) 

P2 



=1 



JV 



, Vj = 1,2,3 



X]l{S(B*)>7V} 
i=l 

N 

J2HsiB^)>-yAiH{Bjy)-HiBfy)) 



(39) 



TV 



, Vi = 1,2,3. 



El 



{S(S')>7V} 



If we restrict ourselves to integer values of the parameters pi and p2, we just have to build a fine grid on the space 
defined by the equation {{pi,P2,P3)\pi + P2 + Ps = K, pi > 0, i = 1, 2, 3} . We then get estimated values of the 
parameters. And, finally, we update the parameters to their new values: 



(r) (t) (t) 

Pi = Pi, P2 = P2, P3 



P3 ■ 



4- We stop as soon as: 7r = 7(i— i) 



7(r-5) 



Remark. Once we have determined the optimal reserved bandwidth on each link, it is quite simple to get the value 
of traffic on the fink, using the bijectivity of the function 
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bandwidth reservation on tiie six iinks of the VPN1 




2.5 3 3.5 4 4.5 

bandwidth reservation on the six iinks of the VPN3 



Figure 11: Simulation of the reserved bandwidth, on each of the 6 links of the 3 VPNs, using the CE method. 
The constant was set to K = 70, estimated values of the gamma densities parameters are: pi = 3, p2 = 4 and 
P3 = 23. 



4.3 Existence of stationary strategies for hierarchical MDPs 

The method we have developed so far, enables us to control optimaly the dynamic evolution of our system, under the 
assumption that the horizon is finite. The optimality results from the introduction of centralized decisions, which aim to 
correct the evolution of the system, in order to satisfy the levels chosen by each VPN client. 



In this section, our purpose is to study the asymptotic behavior of our system. The idea is, like in the very simple 
case of a 3 site- VPN, to prove the existence of a stationary strategy via dynamic programming, and the convergence of the 
strategies obtained with the help of dynamic programming towards those stationary strategies. Indeed, we have proved 
the existence and the convergence of the strategies of a mono-path, stable routing VPN, towards a stationary strategy. 
Besides, the stationary strategies are deterministic, because we apply the simplex algorithm to compute them. Using the 
hierarchical MDP principle. Bellman's optimality equation gives us three distinct sequences of strategies, for each VPN, 
on the time interval [0; T]: 

{F^,F^, F^, F^) F^ , forthe VPN 1, 
< {F^F^, FY, ) ^ F^ , for the VPN 2, (40) 
{F^, Ff , Fi, F^) F^ , for the VPN 3. 

Where F^, F^ andF^, are the associated asymptotic stationary strategies. If, we manage to prove that (Fg, F', Fj, Fj,) 
converges towards a stationary strategy F', for each link I & C , then the system is asymptotically driven in a stable be- 
havior, insofar as the local and the global strategies are both stationary. 

Our aim presently, will be to prove that the stochastic process {L^*^ }t, modehng the dynamic evolution of the demand on 
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h] 



Convergence of the (l-rho)-quantile 




10 20 30 40 50 60 



Figure 12: The quantile function converges in around 15 iterations. 

the hnk of the MPLS network, is ergodic. Consequently, for each link Z e £, we need to solve a hnear program of the 
form: 



S' 3 



max ^^C'(s, a) a;^ 



s—1 a—1 
3 



^ Y,[S{s,s')-Pp{s'\s,a)] = 7(s), Vs' e S' , 

s=l a=l 

[ xi^>0, aGA, sgS . 

For each state s G S', I G C, and each action a G A£, the optimal strategy on the hnk I, is easily obtained from: 



(41) 



(42) 



All we need to do, is to verify the equality between temporal and spatial means: 



V Z e S', Vfc € N, ^lim s'' «) = ,1™ E ^'^ 



T 



(43) 



Recall that the probability distributions /'(s,a) are obtained through linear programming, while we can infer the 
values of the parameters x^^ through simulation only. 



We obtain the coincidence of these two means, which proves the ergodicity of the stochastic process {L^*) }t. We can 
conclude from these results that asymptotically, the local and the centrahzed strategies obtained via Belhnan's optimality 
equation, wiU converge to stationary controls: F^, , and F^, respectively. 

We can infer from these results, that for T not too large (i.e. T < 200), the stochastic dynamic approach is well adapted, 
but if choose to let T increase towards infinity, it becomes rather tedious to compute all the optimal strategies. Since 
we have proved that asymptotically, our system adopts a stationary deterministic control, the use of hnear programming 
provides an elegant and simple solution. 
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Figure 13: Hierarchical MDPs: histograms of the dynamic evolution of the MDPS xf^ , X^*^ and X^^^ . The 



state space associated to each MDP , Y^^^ and Z- *^ , i 



1,2,3, is of cardinality 3: S^' = S^^ = 
S^' = {(0; 9), (4; 5), (8; 1)} , i = 1, 2, 3. But, the global state space, which is required to take decisions 
at the global level, is of cardinality 27^. Consequently, it becomes lastly very hard to cope with such high 
dimensional spaces. 



4.4 The switching control game approach 

In this section, we delve into the fascinating world of stochastic Games. Our aim is still to determine stationary strategies. 
However, while we consider a hierarchical approach in the previous section, we try here, to model the problem as a matrix 
game, where decisions are alternatively taken either by the operator, either by the VPN owners, depending whether or not 
the satisfaction bounds are overwhelmed. Besides, the model is based on an initial assumption, which states that the game 
would converge towards an equilibrium, where the global delay on the links, and the sum of all the delays on each VPN 
would coincide, omitting an additive constant. Furthermore, switching-control games belong to the rare classes of games, 
which can be solved with the help of linear programming. Consequently, this model seems particularly promising. 
We still consider an MPLS network, composed of 3 independent VPNs. The routing is once more mono-path, and stable. 
Using the vocabulary of Game Theory, we observe that our virtual network is made of various actors, whose interests are 
quite opposite. The only aim of the client, owning a VPN, is to get the best possible QoS for his personal traffic. The 
VPN owners behave completly non-cooperatively, since the traffic of each VPN evolves without any collusion between 
the clients, whose single minded purposes are to minimize the delay on their own VPN. Furthermore, the clients are not 
aware of the presence of one another in the game, and behave perfectly selfishly. 

As in the hierarchical case, we still assume that each VPN owner has previously determine a satisfaction level for its 
delay. In the case that this bound would be overwhelmed, the owners should have the opportunity to call for a centralized 
management. The operator realizes this centralized management, by choosing controls on the links of the network. The 
global traffic is still supposed to follow the worst possible evolution, but the operator's purpose is now to minimize the 
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Figure 14: Hierarchical MDPs: dynamic evolution of the traffic on each site, in each VPN. As an application, 
we consider a four state, state space. The hierarchical MDP approach enables us to control optimaly our system, 
and furthermore, to characterize the evolution of each VPN comparatively to one another. 



global delay. 

Once more, we refer to the global process, (X^*' , F^*' , Z^*)), which takes its values in the global state space S''^ x 
X . 

If we think about the way our decisions are made, we realize that the state space can be partionned into two disjoint 
subsets. Indeed, there are some combinations of states {x^^\ y^*\ z*-*-*), that will automatically violate the satisfaction 
bounds imposed by at least one of the owner. As a result, for all these global states, the network should be centrally 
controled. This subset of the state space will be denoted E^. 

On the contrary, on the rest of the global states called E^, the satisfaction bounds are not overwhelmed, and the decisions 
are taken independently on each VPN. 



The idea is to model the problem as a two-person zero-sum game. The first player will represent the set of the 3 
VPNs, evolving independentiy and selfishly. The action space Ag contains all the possible combinations for the choices 
of each site, in each VPN. Naturally each possible combination is formally represented as a 9-dimensional vector. We 
assume that the choice of action on each VPN link is reduced to the 3 alternatives described in the section 2: {no; oi; ^2} • 
Consequently, Ag is of finite cardinaUty, since the choice of actions on the 3 VPNs is independent, and that the action 
space for each VPN is finite. 

The second player will stand for the network manager, who should centrally manage the whole MPLS network, by taking 
actions on the Unks of the network. This time, the action space is denoted D, and it contains all the possible combinations 
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Figure 16: The classification of the state space of the 3 VPNs. 



of actions that could be chosen on each link. We suppose that each decision on each link, is chosen in the 3 element space: 
{do; di; d2}- It is allowed to have the same choices of basic actions: di ~ ai, Vi. 

We make the assumption which characterizes a switching-control game: on the states belonging to E^, only player 1 can 
influence the transitions, whereas on the states belonging to E^, it is the player 2 who controls the transitions. 
However, the reward function depends on the actions of both players, and takes the formal form: 



i,ie{l,2,3},i#i ijG{l,2,3},i/i 

^C^((i^(sxIsY|sz))^,rfO+A , 

.iec 

Vs e E^ U E^, VA e A, VI) e D, A e M+ . (44) 

A G R+, should model the amount of bandwith that the operator always keeps free in the fear of congestion. 
s = sxi2^X2iSX3i\^Yi2SY2iSY3i\szi2^Z2iSZ3i ) £ E^ U E^, rcprcscnts a rcalization of thc global proccss taking 
value in the state space. For the ease of notations, we would rather use the following one: s = {sx\sy\sz)'^ ■ 
Furthermore, to each state belonging to E^, we can define a specific configuration for the traffic value on the MPLS links. 
R, being the routing matrix, we obtain the values of the traffic on the hnks, by computing the matrix equation: 



i.lFll.2.3l.i^l 
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SX31 
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5^12 



^out2 ^Z2i 



V 



:= R 



SY 



\ J 



The set of the various possible hnk traffic configurations, will be once more, denoted C. 



The resulting game is a zero-sum game, since player 1 wants clearly to maximize the reward function, whereas player 
2 wants to minimize the reward. 

We know, as is the case for every matrix game, that the game has a value, and that both players do possess optimal actions. 
This famous result is due to J. von Neumann, and can be found in the rich Uterature related to the subject. The value of 

the game at time t, will be denoted Vt- It is a vector of length |E^ | + |E^ |. 

Presently, our purpose will be to determine the optimal stationary strategies associated to each player. To this end, we use 
the algorithm developped in [5], which is proved to converge in a finite number of iterations. 

In fact, switching-control games belong to the class of stochastic games satisfying the order field property, which charac- 
terizes the single class of games, whose solution can be found in the same algebraic field as the data of the game. This 
class of games is all the more important that, only for such games, can one expect to be able to develop finite algorithm 
for deriving a solution. 

Now, we will describe in details the algorithm we have used, to compute the stationary strategies. 

* We start by choosing an initial deterministic strategy for the player 1, that will be noted i^o(E^). This formal 
presentation, only means that we choose a pure action in each state of the state space E^. 
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* Then, Ff (E^) being fixed, we solve the discounted game with a single controller, (3 £ [0; 1[: 



lE^I 



s=l deD i6£ 



SY 



)t,d) + X] Xsd} 



V J 



(45) 



Xsd > 0, Vd e D, Vs e E2 . 
In the value vector Vj, we stock the value of the game for each component belonging to E^. 

★ For each state s £ E^, we will determine the action Ft+i (s) as an extreme optimal action for player 1, in the matrix 
game: 



n,iis,Vt) 



[E^l + IE^I 

(l-/3)r(s,a,d)+/3 ^ pis'\s, a)Vt{s') 

s' = l 



, Va e A, Vd e D 



(46) 



★ If Vt = Vt-i, then Vt is the value of the game , and Ft(E^) is the projection of an optimal stationary strategy for 
the game on the space E^. 
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Figure 17: The algorithm gives us the indices of the optimal action associated to each state. The cardinality of 
the action space associated to the 3 VPNs was of 27^, and the dimension of the action space associated to the 
links was reduced here, to 12^. On the first picture, we can see the indices of the optimal actions on the space 
of the global decisions, on the second one, we have represented the indices of the local strategies, on the 
space E^. 
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Remark. Once we have determined the stationary strategies, we use the same principle as in the hierarchical approach 
to construct the traffic trajectories. We begin to choose an initial state. Then, if this state belongs to the subspace E^, we 
take the associated optimal actions on each VPN, but if the state belongs to E^, we choose the optimal global actions on 
the links. One of the advantage of this approach, is that we do not need to verify whether or not, the satsfaction levels are 
overwhelmed, since the localisation of the state on the state space gives us enough information. 



5 The curse of dimensionality and optimization in policy space 

The use of Markov decision processes and the associated dynamic programming methodology become rapidly limited, 
due to the high cardinality of the state space. A solution to such a problem, lies in the introduction of parametric repre- 
sentations. There are three main methods to tackle the problem of dimensionality. The first well-known method called 
neuro-dynamic programming, or reinforcement learning, requires the introduction of weights in the value function. In 
each state s e S, the value function takes the form: V{s, r), r > 0. The idea is to tune the weights, so as to obtain 
a good approximation of the value function, and to infer a policy as close as possible to the optimal one. The second 
method, essentialy developed in [18], considers a class of policies described by a parameter vector 9 G M^. The policy 
is improved by updating 6 in a gradient direction, via simulation. The third and last one, called actor-critic, combines the 
principles of both approaches. 

In this article, we concentrate our study on the improvement of the parametrized policy through the policy space. Our 
performance metric will be the average reward function, since the methodology developed in [18] requires such an as- 
sumption in order to introduce the steady state probabilities in the performance function and later, to derive a proper 
estimate for the gradient function. The long term average reward is conmionly denoted: 



X(e) = lim -E 



.t=0 



(47) 



where, we still have. 



E 



X 



(*) 



it) 



At the instant t, we define a parametric matrix of strategies Ft{6). Let 6 e R^, be a parameter of size K > Q. We 
define /t(s, a, 6), as the probability to be in the state s G S, while we choose the action a G A, at the decision epoch t. 
The parametrized transition probabilities and reward function, take the form: 



Pe{s,s') = ^/((s,a, 6l)p(s'|s, a) , Vs,s' e S 
Ct{s,e) = ^Ms,a,9)Ct{s,a) ,WseS . 



(48) 



We denote, P = {P{d) = {pe{s,s'))s,s'es, ^ S R^}, the set of the transition probabilities, and its closure 
which is also composed of stochastic matrices. Furthermore, we make the following assumption, required to prove the 
convergence of the associated algorithm: 

• The Markov chain corresponding to every P E P, is aperiodic, which means that the GCD of the length of all its cycles 
is one. Besides, there exists a state s* £ S, which is recurrent for every such Markov chain. 
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Our purpose is presently, to maximize tlie average reward: 



6* = argmax {A(^)} = argmax< lim — E 

9 I T" — >oo J 



.t=0 



(49) 



where x^*\ represents a realization of the stochastic process {X^*'>}t, at the instant t. The esperance is computed 
relatively to the randomized strategy Ft{9). The first idea is to introduce the well-known gradient algorithm, to get an 
estimate of the parameter. 

e{t + i) = e{t) + 7t Ve \{e{t)) ,7* = ^> * e n . 

Unfortunately, we can't compute analytically the gradient of the performance function, and must resort to simulation. The 
algorithm developed in [18], updates at every time step the value of the parameter, and uses a biased estimate (whose bias 
asymptotically vanishes) of the gradient of the performance metric. 



(50) 



e{t + i) = e{t) + ^t (veCt{x^'\e{t)) + (ct{x^'),e{t)) -\) zt) , 
Xt~+i = At + vit (ct{x^*\e{t))-Xt) . 

7/ > 0, is a parameter which enables us to scale the stepsize of our algorithm for updating A* by a positive constant. 
Then, we simulate a transition to the next state following the transition probabihties {pet+i (a^^*-* ,s), s £ S} . 

At the same time, z is updated according to the following rules: 



Zt+l 



0, ifa;(*+i) =s*, 

Zt + Lg^{x^*\x^*~^^^) , otherwise 



(51) 



where iet(a;W,a;(*+i)) = ^ft+D) \ H Pe^ix^^K x^*+^^) > 0, otherwise. This term can be interpreted as a 

UkeUhood ratio derivative term. 

In the case of a 3 site- VPN, we choose simple parametric strategies. For the site i,i G {1, 2, 3}, we set: 



/fl\sitei /site i 

fvit) „ n\ _ 1 

the probability to choose the action ao for the chain {X^*^}t, fc G {1, 2, 3}, z ^ j, i ^ k 



2 ^ site i 



the probability to choose the action ai for the chain {xf'^f, 

the probability to choose the action 02 for the chain {X-*^}t. 
We note that: 



(52) 



(53) 



fi>o.5^[^xi;^) + ^xg')]<es. 



As a result, the parameters 6i, i = 1, 2, 3 can be interpreted as fuzzy bounds for the system, since it determines the 
probabiUty to choose the action i. 



Remind that the cost function for each site i (i = 1,2, 3), of the VPNl, is of the form: 



X 



it) 






X 



it) 



^.^(*). _ ^ +^'^^ w^^^v $(^r^o) , (54) 
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Figure 18: Belief functions, or parametrized strategies. 

where j, k £ {1, 2, 3}, j, i^k, j . 
Besides, we infer the analytical expression of Ct{x'f \ 9{t)) from (|48p and Ij54|) . 

Finally, the iterative algorithm applied to the site j {j — 1, 2, 3) of the VPNl, takes the form: 

9f ^"(i + 1) = + 7i (V,... Ci(x(*), + (CtixW^eit)) - Af zf ■'"(t) 

'(t + 1) = 9f^{t) + 7* (V,.,c. C*(x(*), + (Ci(xW, - Af zf ■'"(i)) 
_ A^^/ = Af-' +ry7. (c,(xW,0(t))-Af ^) , 

where x*^*^ = (a;^*'', a;2*''), is a realization of the process xj*-* = (xj*-*, xj^-*) (j' = 1, 2, 3), and, 

{0 ifT(*+i) = s* 
(56) 
zr^(t) + Ve,(,)(a;(*),a;(*+i)), otherwise (i = 1,2,3) . 

The application of the algorithm to a MPLS network of 3 independent VPNs, with a stable and mono-path routing, and 
the possiblity to introduce a central management, requires the estimation of 3 * 12 — 36 additional parameters. Locally we 
still have to cope with 3 independent systems of the form (j55p . and consequently we generate 3 samples per VPN, at the 
decision epoch t: (4*\4*\4*^) ~ Pe*i(s,s'), s, s' G forthe VPNl, iy[*\ yi*\ yi*'' ) pg....2{s, s'), s, s' e 
for the VPN2, and finally, (zf \ z^*\ z^*^) ~ Pgf.3{s, s'), s, s' £ for the VPN3. 

We introduce global transition probabilities, and rewards on the links: 

den .57^ 

Ct{i,9) = Y,Mi^d,e)Ct{i,d) , I, r e sius2u...us'^' . 

deD 

Each time, one of the satisfaction levels is overwhelmed, we solve the global iterative algorithm: 



9t+l 

\link 



^link 



' It 

Vlt 



7gCt{Rix<^'\y('\zWr,et) + {Ct{RixW,y('\zW)T,Ot)~Xr'^) 



Zt 



Ct {R{xW,y('\zW)^,9t) 



^i,„k 



(58) 



with. 
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Figure 19: Parametric strategies on a single VPN. As an example, we consider once more, the 3-site VPN, 
with stable and mono-path routing (i.e. the unique path between each couple of nodes, is the directed Unk be- 
tween these sites). The sites are independent of one another, consequently the algorithm appUes independently. 
The recurrent states are supposed to be: {tl^^l^, 0) for the site 1, {tl^^^^, 0) for the site 2, and (t^m; 0), for the third 
site. The convergence of the average reward occures in around 100 iterations. 

^ r 0, if = R[tLO\tl,0\tl,Of, 
\ zt ^ Lgitat , otherwise. 

6 Conclusion 

We have developed an original approach to tackle the problem of decision taking under uncertainty. The choice of 
optimizing a QoS criterion such as the delay is rather arbitrary, and could be extended to various objective functions. 
This article gives us rules to control optimaly a VPN so as to minimize the delay under the assumption that the traffic 
follows the worst possible evolution. We first determine a solution on a finite horizon [0; T], using extensively Bellman's 
principle. But, asymptotically, we would rather apply Unear programming, since under such an assumption the sfrategies 
can be assumed stationary. The case of the management of three VPNs is also studied, via the introduction of hierarchical 
MDPs on the one hand, and stochastic Games on the other hand. The use of the Cross-Entropy method makes us able to 
forecast the trajectories of our system, provided we are given an initial state, or at least, an initial distribution on the states. 
A curious point which could be evoked, is that the system evolved without any observation, since all the possible behaviors 
should be predicted and kept in memory before the system enters its initial state. In fact, the system evolution is bUnd, and 
completiy disconnected of the reaUty. An interesting idea should be to introduce observations, so as to adapt the evolution 
of the system. The introduction of Partially Observed Markov Decision Processes (see [12], [13], [14], [15]) might also be 
quite promising, but rather hard to put in appUcation due to the large cardinaUty of our state spaces. 
Indeed, the curse of dimensionality appears as soon as we have to manage a complex network of more than one VPN. 
The state space becomes fastly huge, and Bellman's principle gets quite difficult to put in application. Fortunately, 
techniques of simulation based optimization over the policy space ([18]), represent an alternative approach that we have 
tested successfully. The idea is to intoduce parametrized strategies, that depend on a set of unknown parameters. A 
simulation algorithm is then proposed for optimizing the average reward, and at the same time, the unknown parameters. 



36 



Cliain X12ii 




10 20 30 40 50 60 70 80 90 100 



Chain X23(t) 





10 20 30 40 50 60 70 80 90 100 



Figure 20: Parametric strategies on a single VPN. The second picture represents the dynamic evolution of 
the sampled trajectories for each site of the VPN. 
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Figure 21: Parametric strategies on a single VPN. Convergence of the parameters defining the parametric 
strategies of the VPNl's sites. 

In a practical point of view, the use of this apporach is all the more interesting, since to our knowledge, it has been tested 
only on few concrete case studies. 
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Figure 24: Parametric strategies on a MPLS-network. Convergence of the average reward on the Unks. 
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